Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

RepresentationSeries: pca, nmf, tsne #140

Conversation

henrifroese
Copy link
Collaborator

  • implement full support for Representation Series in "Dimensionality Reduction" functions of representation module
  • add appropriate (parameterized) tests in test_representation.py

Note: the dimensionality reduction functions do support both flat and represenational input; this is because as discussed in #134 we will not implement a to_repr function, so users might have e.g. a flattened tfidf series they want to perform pca on -> we allow them to use the dimensionality reduction functions. Of course, we could remove the flat support (if we basically want to "forbid" users from using VectorSeries for tfidf/term_freq/count); we're not 100% sure.

…ion" funcionts of representaion module

- Implement full support for Representation Series in "Dimensionality Reduction" functions of representation module
- add appropriate parameterized tests in `test_representation.py`

Note: the dimensionality reduction functions *do support* both flat and represenational input; this is because as discussed in jbesomi#134 we will not implement a `to_repr` function, so users might have e.g. a flattened tfidf series they want to perform pca on -> they can use the function. Of course, we could remove the flat support (if we basically want to "forbid" users from using `VectorSeries` for tfidf/term_freq/count; we're not 100% sure.

Co-authored-by: Maximilian Krahn <[email protected]>
@jbesomi
Copy link
Owner

jbesomi commented Aug 5, 2020

🎉

Opinions:

  1. I would stick to the rule "One function - one single input HeroSeries type"

Review:

  1. Output of both pca and nmf make sense to be VectorSeries
  2. pca by its algorithm definition require as input a dense array (as data are normalized). The input, in this case, should be VectorSeries.
    --> We will need to explain clearly in the documentation that is very advised to limit the length of the vector (i.e number of features in tfidf)
    --> pseudocode: s.tfidf(max_features=200).flatten().pca() ...
    --> We need to check the input is VectorSeries. If it has multiindex we need to explain the function should be flattened.
  3. nmf work with a sparse matrix. In this case, it would be inefficient not to work with a RepresentationSeries as input.
    --> In getting started/representation we will need to explain the difference between the two approaches and explain which approach (pca or nm) is preferred and when
  4. tsne: TODO, not sure if it uses dense or sparse input ...

Hope you got some insights! 👍

@henrifroese henrifroese changed the title RepresentationSeries: pca, nmf RepresentationSeries: pca, nmf, tsne Aug 6, 2020
@henrifroese
Copy link
Collaborator Author

henrifroese commented Aug 6, 2020

I believe that it is a bad idea to restrict the pca input to VectorSeries. Users will almost always first apply one of count, term_frequency, tfidf, so they have a RepresentationSeries of their documents. They then want to do dimensionality reduction. If pca takes VectorSeries input, and nmf, tsne take RepresentationSeries input, we have very non-user-friendly behaviour that goes against the goal of texthero:

df["pca"] = df["texts"].pipe(hero.tfidf).pipe(hero.flatten).pipe(hero.pca)
df["nmf"] = df["texts"].pipe(hero.tfidf).pipe(hero.nmf)

So two functions that both do dimensionality reduction need a different pipeline, that's not nice for users!

Yes, pca internally in sklearn does not handle Sparse input. That is why in the code we do in line 492: s_for_vectorization = s_coo_matrix.todense() # PCA cannot handle sparse input.

This allows the user to call all dimensionality reduction functions the same way. This is why I see two possibilities:

  1. Allow both RepresentationSeries and VectorSeries input for dimensionality reduction functions (but I understand you don't want to do this)
  2. Only allow RepresentationSeries input

I think it should stay an implementation detail and not be visible to users that pca cannot handle the sparseness internally (that's why we issue a warning if the input's dimensionality is too high; we can also add a comment about that in a tutorial). It's much nicer if the dim. red. functions are all called the same way. What do you think?

@jbesomi
Copy link
Owner

jbesomi commented Aug 6, 2020

Hey Henri, I totally agree with your view and understand your points!

I agree with having all representation functions (pca, nmf, tfidf) to receive as input a RepresentationSeries.

I think it should stay an implementation detail and not be visible to users that pca cannot handle the sparseness internally (that's why we issue a warning if the input's dimensionality is too high; we can also add a comment about that in a tutorial). It's much nicer if the dim. red. functions are all called the same way. What do you think?

I believe the users should be very aware that pca cannot handle sparse input. We will have to make this very clear in the docstring (maybe citing an external source explaining why or eventually we will write a blog post on that) that pca will dense the Series. Going further, we will mention that the number of features should be limited (by saing to set max_features=X) as well as we will encourage to use nmf in case the number of features is very large. One of Texthero goal is also to teach new users, and having a rough idea of advantages/disadvantages of the different dim. red. algorithms is for sure important.

Regarding the max_features default: an alternative might be to set the default max_features argument to the representation functions (tfidf, term_freq, ...) to 100/200 or 300. Opinions? I haven't a clear position on that, but generally I liked the idea of keeping (when it make sense and it's feasible) the same values as scikit-learn.

@jbesomi jbesomi marked this pull request as draft August 7, 2020 15:26
@henrifroese
Copy link
Collaborator Author

I believe the users should be very aware that pca cannot handle sparse input. We will have to make this very clear in the docstring (maybe citing an external source explaining why or eventually we will write a blog post on that) that pca will dense the Series. Going further, we will mention that the number of features should be limited (by saing to set max_features=X) as well as we will encourage to use nmf in case the number of features is very large. One of Texthero goal is also to teach new users, and having a rough idea of advantages/disadvantages of the different dim. red. algorithms is for sure important.

I will add something about that to the docstring! I agree that when writing a full tutorial on representation, we have to describe all this in detail; otherwise users will just use pca and wonder why it crashes their colab 🙃 .

Regarding the max_features default: an alternative might be to set the default max_features argument to the representation functions (tfidf, term_freq, ...) to 100/200 or 300. Opinions? I haven't a clear position on that, but generally I liked the idea of keeping (when it make sense and it's feasible) the same values as scikit-learn.

I agree it's good to stick to the scikit-learn defaults and mention this in the tutorial (the representation tutorial will need to be very good 🤓 !)

- dim. red. functions only support Representation Series input
- appropriately change tests
@henrifroese
Copy link
Collaborator Author

henrifroese commented Aug 7, 2020

Just removed support for VectorSeries input in dimensionality reduction functions. Should be ready to merge now

QUESTION: About the clustering functions: should they:

  • (a) only support VectorSeries input (Pro: probably mostly get called after dimensionality reduction -> VectorSeries as input)
  • (b) support VectorSeries and RepresentationSeries as input (Pro: some users might want to call it straight after tfidf etc.; dbscan, kmeans also directly support sparse input so users can efficiently work straight with RepresentationSeries)

@jbesomi
Copy link
Owner

jbesomi commented Aug 8, 2020

Good question!

a) Only vector series: would very inefficient in some cases ...

b) We would not respect the rule "one function type for s" ... (but c'est la vie ... !)
--> And what if dim red functions return only RepresentationSeries and cluster functions accept only RepresentationSeries? How do you see this option? (User then would use flatten to store the Series in the DataFrame). The problem then would be if a user wants to re-extract a Series from a PandasDataFrame, in this case, it would receive a VectorSeries that would not work in any clustering function ... not the best idea ...

So, to recap right now we have:

embedding functions (tfidf, count, ...): input TokenSeries, output RepresentationSeries
dim red: input RepresentationSeries, output: VectorSeries
clustering: input RepresentationSeries or VectorSeries, output: ?

This might be a bit confusing/overcomplicated/limited, what's your advice on that?

An alternative might be: keep only RepresentationSeries and if needed use flatten to save in a DataFrame, but then can we extract again a RepresentationSeries?

Conclusion: I would like to see/understand your opinion! 👍

@henrifroese
Copy link
Collaborator Author

henrifroese commented Aug 9, 2020

I agree it's really annoying that we can't just only work with RepresentationSeries as most users will probably use DataFrames.

An alternative might be: keep only RepresentationSeries and if needed use flatten to save in a DataFrame, but then can we extract again a RepresentationSeries?

I think even if possible, it's not really a nice UX if users have to use flatten and to_repr often.

I think the possibilities are:

  1. Fully switch to RepresentationSeries, even for non-sparse stuff like output from dimensionality reduction and embed. This however destroys DataFrame compatibility
  2. Forget the one function, one input rule (even though it's useful), and:
    1. let count, tfidf, term_frequency return RepresentationSeries
    2. let embed, pca, nmf, tsne (as it's never Sparse) return VectorSeries
    3. Let pca, nmf, tsne, all clustering functions handle both RepresentationSeries and VectorSeries input

I believe that possibility two is the most intuitive for users, as the only time they'll ever notice it is when trying to put a Series with count, tfidf, term_frequency into a DataFrame; we'll clearly explain that they'll have to use flatten.

@henrifroese
Copy link
Collaborator Author

Closing this, see #156

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants